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Abstract — "THIS PAPER IS ELIGIBLE FOR THE STUDENT 
PAPER AWARD" 

By combining a bound on the absolute value of the difference 
of mutual information between two joint probability distributions 
with a fixed variational distance, and a bound on the probability 
of a maximal deviation in variational distance between a true 
joint probability distribution and an empirical joint probability 
distribution, confidence intervals for the mutual information 
of two random variables with finite alphabets are established. 
Different from previous results, these intervals do not need any 
assumptions on the distribution and the sample size. 

I. Introduction 

In this paper confidence intervals for the mutual information 
of two random variables with finite alphabets are established. 
While they are not particularly tight, they are the first where 
no further restrictions have to be considered, neither on being 
in an asymptotic regime nor on the underlying joint probability 
distribution. By quantization of random variables with a non 
finite alphabet it is also possible to find the lower bound of the 
confidence interval of the mutual information of such random 
variables. The simplicity of these confidence intervals also 
allows to give an upper bound on the necessary sample size 
when the confidence interval width, the confidence level, and 
the alphabet sizes are fixed. 

II. NOTATIONAL SETUP 

Let X, Y, X 1 , Y' be two pairs of finite discrete random 
variables, with joint probability distributions 

Pxy = {pxr(i,j) :i = l,2,...,M x ; j = 1,2,..., M y }, 
Px>Y- = {px'Y'(i,j) :i = l,2,...,M x ; j = 1,2,...,MJ. 

Here X, X' 6 X and Y,Y' E y and it is w.l.o.g. assumed 
that X = {1,2,..., Ma,} and that y = {1, 2, . . . , M y }. 
The marginal probability distributions are px = {px{i) ■ 
i = 1,2,..., M x }, PY = { PY (j) : j = 1,2,..., M v }, 
Px> = {PX'(i) ■ i = 1,2, . . . , M x } and p x > = {py'(J) ■ j = 
1, 2, . . . , My}, where the marginals are calculated from the 
joint probalility distributions as usual. The Shannon entropy 
|[T] is defined as 

H(X) = H( PX ) = -Y,Px(i)logp x (i) 
i=i 



and the joint entropy |Q] as 

Ma M y 

H(XY) = H(p XY ) = - ^2^2pxY(i,j)\ogpxY(i,j)- 

i=i j=i 

All logs are natural if not stated otherwise. %{•) is defined as 
the binary entropy function 

H(x) = —x log x — (1 — x) log(l — x). 

The mutual information UJ is defined as 

I(X; Y) = I(p XY ) = H(X) + H(Y) - H(XY). (1) 

W.l.o.g. it is assumed, that M x < M y , what can be done 
because the mutual information is symmetric (I(X; Y) = 
I(Y; X)), and therefore by renaming the variables if necessary 
it can be assumed that M x < M y always holds. The variational 
distance between two probability distributions is defined as 



V(pxY,PX>Y> 



\\PXY -PX'Y'Wl 

^2^2\pxY(i,j) 

i=l j = l 



PX'Y'(i,j)\, 



and similarly for the marginal distributions. It can be easily 
seen, that V(-, •) S [0, 2] for any two probability distributions. 
The empirical joint distribution for an i.i.d. sequence of pairs 
((xi, ?/i ), (xa, j/a), . . . , (x n ,y n )), sampled from a distribution 
Pxy, is defined as 



p xnyn = {p xny «(i,j) : i 
where 



1,2,.. 



,M X ; j = 1,2,. ..,M y }, 



1 

Px n y n (h j) = ~~ y ] 3x k iS yk j 
fc=l 



(2) 



and 5;.j is the Kronecker delta. 



III. Related Work 

The following two bounds will be used to construct the 
confidence interval for mutual information and are stated here 
as two Lemmas. 

Lemma 1: Let (X, Y) and (X' , Y') be two pairs of random 
variables taking values on the same range, with joint proba- 
bility distributions pxy and px'Y' ■ Let 

e = V{pxy,Px'Y')- 



If e < 2 - 



then it holds that 



M x My 

\I(X;Y)-I(X';Y')\ 

< 3 • | log(M x M y - 1) + 3W(f ). (3) 

Lemma 2: For any e > 

Pr{^(pxy,Px-y") > e} < (2 M * M * - 2)e^" 2 / 2 . (4) 

The first bound was found by Zhang [2 Theorem 2]. In 
the next section this bound will be slightly improved and 
generalized for the usage here, using a result of Ho and Yeung 
(3] Theorem 6]. The second bound was originally found by 
Weissman et al. J3] Theorem 2.1] and slightly modified by Ho 
and Yeung [3, Lemma 3] to have no dependence on the true 
distribution. 

IV. Results 
First, (O is improved to yield: 

Theorem 1: Let (X , Y) and (X',Y r ) be two pairs of ran- 
dom variables taking values on the same range, with joint 
probability distributions pxy and px'Y' an d M x < M y . Fix 
an e > 0. Let 

V(p X Y,Px'Y') < e- 

Then it holds that 

\I(X-Y)-I(X':Y')\ 

2 log[(M x M y — 1)(M X 



For e < 2 



Mj 



then it holds: 



1)(M„-1)] + 3W(|) 



< 



for e < 2 
log(M x ) 
for e > 2 



2 
A/., 



2 



(5) 



Proof: The proof widely follows the lines of the proof of 
(0 in Zhang [2] Eq. (2)], but replaces the entropy difference 
bound of Zhang [2, Eq. 4] by the corresponding bound in Ho 
and Yeung [3 Theorem 6], what makes the new bound valid 
for any e and also for any V(pxy,Px'Y') < e instead of 
V(pxy ,Px'Y') = £■ Beyond this, some slight changes in the 
proof of Zhang lead to a tighter bound. 
First it is shown that V(px ,Px>) < e : 



||px -Px'lli 
bx(i) - 



^(pxr(i,j) -Px'Y'(i,j)) 



M x 
i=l 

M x My 

i=l i=l 

M * My 

i=l j=l 

= V(pxY,PX>Y>) 
< 6 



PX'Y'(i,j)\ 



In an analogous way it can be shown that V(py ,PY') < £• 



\I(X;Y) - I(X';Y')\ 

= \H(X) + H(Y)-H(XY) 

- H(X') -H(Y') + H(X'Y')\ (6) 

< \H(X) - H(X')\ + \H(Y) - H{Y')\ 
+ \H{XY) -H{X'Y')\ 

< | log(M x - 1) + H(§) + 6 - \og(M y - 1) + W(f ) 
+ ilo g (M x M 1 ,-l)+K(f) (7) 

= | log[(M x M H - l)(M a; - 1)(M„ - 1)] + 3W(f ) 

In ((6]) eq. ([T]) was used. In (O the bound of Ho and Yeung [ 3 , 
Theorem 6] was applied together with the assumption M x < 
My and therefore, by the assumption e < 2 — with 2 — 



M X M V 



> 2 



— > 2 

My ^ 



M, 



> €. 



For e > 2 — -jg- the well known bounds on mutual 



information and entropy [1|, I(X;Y) > and I(X;Y) < 
H(X) < log M x are first used to show that 

< I(X;Y),I(X';Y') <logM x , 



what immediately implies 

\I(X;Y)-I(X';Y')\<logM x , 



(8) 



independent of e, what completes the proof. ■ 
Remark: The absolute entropy difference bound of Ho 
and Yeung J3J Theorem 6] could also be used to bound 
\I(X;Y) - I(X';Y')\ in the case e > 2 - but here it 
can easily be seen that \I(X;Y) - I(X';Y')f= \H(X) - 
H(X')\ + \H(Y)-H(Y')\ + \H(XY)-H(X'Y')\ < \ogM x + 
\H(Y) - H(Y')\ + \H{XY) - H{X'Y')\ > logM^ and 
therefore the upper bound log M x is tighter for e > 2 — . 
From this argumentation it can also be seen that the upper 
bound for the case that e is smaller, but close to 2 — -jg-, is 
still greater than logM^, and could therefore be improved by 
taking the minimum of this bound and log M x , but for the sake 
of simplicity and applicability of this bound this improvement 
has not been applied in TheoremQ] This shows that this bound 
is only useful for sufficiently small e, since log M x is a well 
known and in the context of confidence intervals trivial bound. 
Nevertheless Q is everywhere tighter than ((3}, applicable 
for any e, and the variational distance V{pxy ,Px'Y') has 
only to be less or equal e and not strictly equal to e for 
(0. Therefore Theorem 1 is an improvement of the bound 
of Zhang (Lemma [TJ. 

Finally the confidence interval is constructed by a combi- 
nation of Theorem Q] and Lemma [2] 

Theorem 2: For any a £ (0, 1] and M x , M y with M x < 
My let (where In is the natural logarithm) 



2 2 M * M v - 2 

— In 

n a 



and 



Me) 



2 - \og[(M x M y - 1)(M X - l)(M y - 1)] + 3H(§ ) 
fore<2-^- 



log(A4) 



for e > 2 - -jjf- 

then, for any two random variables X, F with true joint 
probability distribution pxy and empirical joint probability 
distribution px n Y n it holds that 

Pr{/(p x „yn) - A/(e) < /(pxr) < 7(px»y»)+A/(e)} 

>l-a. 

Proof: Rewriting as 

Pr{F(pxy,Px»y») < e} > 1 - (2 M » M « - 2)e~" £ 2 / 2 , (9) 

and solving 1 — a = 1 — (2 M * M » — 2)e~™ e2 / 2 yields (obviously 
only the positive solution is of interest) 



2 2 M - M « 

— In 

n a 



e = 

Then it follows that 

1 — a 

< PrjV^pxy^x^y™) < e} 

< Pr{|/( Px , y „) - I( P xy)\ < AJ(e)} 

= Pr{I(px»y«) - Me) ^ 7 (p^) ^ I(?x-Y- 



(10) 

A/(e)}, 



where (TToT > is an application of Theorem Q] ■ 
The next theorem gives an upper bound on the necessary 
number of samples n, to achieve a given confidence interval 
width at a given confidence level 1 — a. 

Theorem 3: For any a £ (0, 1], M x , M y , with M x < M y , 
and 7 £ (0, log Ms) let e be the minimum root of 

| log[(M x M y - l)(M x - l)(My - 1)] + 3tt(§) = 7- (11) 

Then for ( ["•] is the ceiling operator) 

— In 

r- 



it holds that 

Pr{I{ PxnY , 



7 < I (pxy) < I(px»Y») + 7} > 1 



Proof: If 7 > log M x then the probability of being within 
the bounds is trivially one, therefore 7 is restricted to be less 
log Mx. Then obviously only the first part of ((5} 

\ log[(M x M y - 1)(M X - l)(M y - 1)] + 3W(§) 

applies, where e < 2 — ■£-. It is easy to show, that this term is 
strictly increasing for e £ (0,2 — -»§-)■ Therefore there is only 
one solution for e 6 (0,2 — of equation (TTTb which is just 
the desired maximal variational distance between the true and 
the empirical joint distribution. This e is also the minimum 



root as stated in the theorem. Then solving (0 for n, after the 
substitution of Pt{V(pxy ,Px™Y n ) < e} by 1 — a, yields 



and therefore 



2 2 M - M * - 2 

n > — In 

e z a 



2 2 M * M y - 2 

-it In 

e z a 



cleary suffices to guarantee 

Pr{/(px«y») - 7 < Hpxy) < I(px»Y») + 7} > 1 ~ 



The next theorem is an improvement of Theorem|2] that uses 
the entropy optimization procedures of [3. Theorems 2 and 3], 
which depend on the actual empirical distribution, instead of 
the worst case entropy difference bound J3] Theorem 6]. 

Theorem 4: For any a £ (0, 1] and M x , M y with M x < 
My let 



'2 2 M * M v-2 
e = \ I - In 



n a 



and let 



min H(X) 



PX- V(p X n ,p X )< 



min H(Y) 



py: V(p Y n ,py)<e 



max H{XY), 

Pxy- V(p x ^Y" ,Pxy)<e 



-fmax = max H(X) + max H(Y) 

Px- V (px™ ,Px)<e py: V(py«,PF)<« 

min ff(XF) 

where the solutions for the entropy optimization problems are 
given in [3 Theorems 2 and 3]. Then it holds that 

Pr{imin < I(j>XY) < ^max} > 1 - Of. 

Proof: Since y(px»,PJf) as well as V(py™ ,Py) are < 
V r (px™y™,Pxy) < e, as shown in the proof of Theorem Q] it 
is obvious that 

min I(pxy) > ' min ) 

max /(pxy) < /max- 

Pxy: y(px™yi :Pxy )<<= 

By the argumentation of the proof of Theorem |2] again 



< 2 2 M - M «-2 
e = \l - In 



ft a 
is fixed, and it follows that 

1 - a 

< Pr{V(pxy,Px»y) < e} 

< Pr{/ min < I (pxv) < /max}- 



V. Discussion 

Theorem [3] can be seen as an upper bound for n (the 
number of samples), which is tight when Theorem [2] is used 
to determine the confidence interval. This is explained by the 
fact, that the absolute entropy difference bound that was used 
to construct the confidence intervals is completely independent 
of the actual empirical distribution p x n y n. Also, by using 
the entropy difference bounds, the dependence between the 
entropies H{X), H(Y) and H(XY) was ignored, since for 
example the worst case distribution p x ™ is not necessarily the 
marginal of the worst case distribution p x ™ y ™, what makes the 
mutual information difference bound less tight again. 

Taken together, one can see that there is much room left for 
improvement. By this, n of Theorem [3] is an upper bound on 
the necessary smaples size. 

A first improvement of this situation was given in Theo- 
rem |4] 

An approach for making also use of the dependence between 
the entropies is given as a conjeture and only for two binary 
random variables in ||4). 

Besides this in the preprint [7|, an algorithm for finding 
the lower bound of the confidence interval for a binary and 
an arbitrary finite random variable is given. This bound is 
tight in terms of the maximal variational distance between the 
empirical and the true joint distribution. 

VI. Numerical Examples 

In this section the different possibilities for the construction 
of the confidence intervals, which just have been discussed 
are compared in two numerical examples. In these particular 
examples it can be seen that the lower bound conjectured in 
[4] (called Method 1) matches the lower bound of preprint |]7] 
(called Method 2) which gives a further indication for the 
correctness of at least the lower bound in [4 1 (though there is 
still no proof available). 

The following setup is used: A binary symmetric 
channel (BSC) with input variable X and output vari- 
able Y is given, where the bit error rate (BER) is 
equal to 0.1 and the input probabilities px = {5,5}- 







1 -BER 








The joint probabilities therefore are 

pjcHl, 1) = 0.45, PxyQ-,2) = 0.05, 
pjcyft 1) = 0.05, Pxy(2,2) = 0.45. 

In this case the true mutual information is known to be 
Hpxy) = 1 - H(0.l) « 0.53100 



(unlike in the sections before, in this section all logs are to 
the base 2). Then, taking n = 10 5 samples from pxy yielded 
the following exemplary empirical distribution 

PxV (1,1) = 0.44950, p x n v n (1,2) = 0.05058, 
p x n yn (2, 1) = 0.04868, p x n y n (2, 2) = 0.45124. 

Now fixing the confidence level l — a = 0.95 the predescribed 
methods could be used to estimate the confidence interval. 
Before this is done, a good approximation to the best possible 
confidence interval is determined, where best possible interval 
is defined as having minimal interval width. Therefore samples 
of size n are sampled 10 5 times from pxy, yielding an 
exemplary empirical sampling cumulative distribution function 
(cdf) of I(px"Y") (shown in Fig. [TJ, which should be a 
sufficiently good approximation to the real sampling cdf of 
I(px n Y n )> due to the high number of samples. 



CO 

S 
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:s 
s 




0.55 



Then, since it can be seen from the empirical sampling cdf 
of I(px n Y n ) that the sampling probability density function 
(pdf) is close to being unimodal and symmetric, the approxi- 
mation to the smallest possible confidence interval is given by 
the f -quantile w 0.52517 and the (1 - § )-quantile « 0.53699 
of the empirical sampling cdf of I(px n Y n ) (both marked in 
Fig. ID- 

In Table |T] the results of the two methods described in 
Section ITVl (Theorem l2l and l4l and of Method 1 and 2, applied 
to p x n y n, are given. 

Here it can be seen, that the independence of the empirical 
distribution in Theorem [2] makes the confidence interval pretty 
broad compared to the other methods. Besides this, one can 
see that the improved methods (Method 1 and 2 in Table [T]> 
have nearly the same performance as Theorem 0] The situation 
rather changes when a true distribution with small mutual 
information is used (such a situation is prevalent in JS]). This 
is shown in the following example, where a BSC is used with 
BER = 0.2 and an unequally distributed input variable X with 



TABLE I 



TABLE II 



Method 


Confidence interval 


Lower bound 


Upper bound 


Width 


approximated best possible 


0.52517 


0.53699 


0.01182 


Theorem 2 


0.38170 


0.68504 


0.30334 


Theorem 4 


0.51645 


0.55091 


0.03445 


Method I 


0.51666 


0.55080 


0.03414 


Method 2 


0.51666 







distribution px = {0.1, 0.9}. The joint probabilities therefore 
are 

pxy(l,l)=0.08, P xy (1,2) =0.02, 
p X y(2,l)=0.18, pxy(2,2) = 0.72. 

Here the true mutual information 

I(pxv) ~ 0.10482. 

Again taking n = 10 5 samples from pxy yielded the follow- 
ing exemplary empirical joint distribution 

Px n y n (1,1)= 0.07996, Px n y n (1, 2) = 0.02023, 
Px n yn (2, 1) = 0.18012, Px n y n (2, 2) = 0.71969. 

The sampling cdf of I(px n Y n ) in this case can be seen in 
Fig. [2] The approximation to the smallest possible confidence 




0.096 0.099 0.102 0.105 0.108 0.111 0.114 



7(pX"Y") 
Fig. 2. 

interval is determined by the same method as in the first 
example. The results are given in Table ITTI 
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Method 


Confidence interval 


Lower bound 


Upper bound 


Width 


approximated best possible 


0.10143 


0.10826 


0.00683 


Theorem 2 


-0.04743 


0.25591 


0.30334 


Theorem 4 


0.05269 


0.15721 


0.10452 


Method I 


0.08679 


0.12402 


0.03723 


Method 2 


0.08679 
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